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Abstract 

Non-linear manifold learning enables high-dimensional data analysis, but requires out- 
of-sample-extension methods to process new data points. In this paper, we propose a 
manifold learning algorithm based on deep learning to create an encoder, which maps a 
high-dimensional dataset and its low-dimensional embedding, and a decoder, which takes 
the embedded data back to the high-dimensional space. Stacking the encoder and decoder 
together constructs an autoencoder, which we term a diffusion net, that performs out-of- 
sample-extension as well as outlier detection. We introduce new neural net constraints for 
the encoder, which preserves the local geometry of the points, and we prove rates of con¬ 
vergence for the encoder. Also, our approach is efficient in both computational complexity 
and memory requirements, as opposed to previous methods that require storage of all train¬ 
ing points in both the high-dimensional and the low-dimensional spaces to calculate the 
out-of-sample-extension and the pre-image. 


1 Introduction 

Real world data is often high dimensional, yet concentrates near a lower dimensional mani¬ 
fold, embedded in the high dimensional ambient space. In many applications, finding a low¬ 
dimensional representation of the data is necessary to efficiently handle it and the representation 
usually reveals meaningful structures within the data. In recent years, different manifold learn¬ 
ing methods have been developed for high dimensional data analysis, which are based on the 
geometry of the data, i.e. preserving distances within local neighborhoods of the data. These in¬ 
clude kernel PC A [^, ISOMAP [^, locally linear embedding (LLE) [^, Laplacian eigenmaps [^, 
Hessian eigenmaps and Diffusion maps [^. Some of these methods are spectral methods, 
based on the eigenvectors of adjacency or affinity matrices of graphs on the data. These meth¬ 
ods are capable of capturing a smooth representation of the data and have been shown to be 
robust to noise and outliers. The ability to capture the underlying low-dimensional structure 
of data has made these methods appropriate for dimensionality reduction. Unlike classical di¬ 
mensionality reduction methods, such as principal component analysis (PCA), these methods 
are nonlinear, which is essential as real-world data typically does not lie on a hyperplane. In 
addition, they preserve local structures in the data, while disregarding distances between points 
that are far apart, which are typically meaningless in high-dimensional data. These approaches 
are very popular in machine learning, signal processing and data mining applications. 

When the data set is very large, or when processing online sequential data, it is impractical 
to directly compute an embedding for the entire dataset. The computational complexity of 
calculating the affinity matrix and the eigen-decomposition of the matrix become infeasible 
in terms of memory and running-time. Since these non-linear techniques do not provide an 
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explicit mapping from the data to the embedding, out-of-sample extension (OOSE) methods 
are used to extend the embedding to new data points [7[jl4]. In such cases, the low-dimensional 
embedding is constructed for a representative sample of the data and is then extended to all 
remaining, or new points. This is a common approach in image processing applications, for 
example, especially for high-resolution images [T^[T6] . 

Deep learning has gained popularity in the past years, achieving state-of-the-art results in 
machine learning, computer vision and speech processing applications, handling increasingly 
large datasets [T7|jT^ . Deep neural nets are capable of learning increasingly abstract repre¬ 
sentations for the data 20 , some of which are robust to small perturbations around training 


points |21[|22| . However, these representations are built globally, without incorporating local ge¬ 
ometry or density of the data. Jia et al. recently introduced Laplacian Autoencoders, which 
impose locality preserving constraints via the weighted affinity matrix. Their goal is to improve 
pre-training of autoencoders in constructing neural networks. However, their formulation lacks 
both a parameter to balance between the reconstruction loss and the affinity regularizer, and an 
embedding to compare to, so there is no way to ensure that the final representation is anywhere 
near the true manifold eigenfunctions. Because of this, there are no theoretical guarantees on 
the convergence of the algorithm or on the usefulness of the representations. 

In this paper, we propose a new approach to out-of-sample extension, applying a deep neural 
network to learn the mapping between the data and the embedding. We employ deep learning 
from a manifold learning perspective, by explicitly incorporating a manifold embedding of the 
data in the deep learning framework. We address three closely connected problems: OOSE 


of the embedding to new points, a pre-image solution 24,^ that can include regularization, 
and outlier detection on test data. This third goal is important and often neglected in OOSE 
methods. If new data is an outlier and does not fit the model of the training data, the extension 
is ill-defined and its new representation will be insufficient. 

To accomplish these three goals, we propose to train a neural network-based encoder and 
decoder, and combine them in an autoencoder. First, we train a multi-layer encoder to ap¬ 
proximate the low-dimensional embedding on the data; specifically we use diffusion maps 
for embedding the data. The encoder performs OOSE of the embedding. We then train a 
multi-layer decoder whose input is the diffusion map, to learn the inverse mapping between 
the embedding and the data. The decoder enables to recover the pre-image of the embedding, 
mapping new points in the diffusion space to the high-dimensional data space. Finally, by 
stacking the two networks, we obtain an autoencoder, termed the diffusion net. The diffusion 
net enables to perform outlier detection, indicating when the extension of a given point via the 
encoder is faulty due to its being an outlier that does not follow the model of the data. The 
diffusion net also performs denoising of the data, reconstructing a clean version of noisy data. 
Our approach is both computationally efficient and has low memory costs. Once the diffusion 
net has been trained, it is unnecessary to retain the training data and embeddings as required in 
other methods. Thus, harnessing the efficiency of deep learning networks enables to efficiently 
process large quantities of data. 

This paper is organized as follows. Section [^provides background on manifold learning and 
deep neural networks. In Sections and |^we propose a deep learning approach for manifold 
learning, enabling out-of-sample extension, pre-image computation and outlier detection. A 
proof on bounding the convergence of a single layer encoder is presented in Section We 
present experimental results in Section Future research directions are discussed in Section 
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2 Background and Related Work 


2.1 Diffusion Maps 


Diffusion maps is a popular manifold learning technique, based on the construction of the graph 
Laplacian of the data set [^. It has been used successfully in various signal processing, image 
processing and machine learning applications 10,16,26-32 . In this section, we briefly review 


the diffusion maps construction. For a detailed discussion on diffusion maps, see 

Given a high dimensional set X — a weighted graph is constructed with the data 

points as nodes and the weight of the edge connecting two nodes, A:(x^,Xj), Xi^Xj G X, as a 
measure of the similarity between the two data points. The affinity matrix K[i^j] — k{xi^ xj) is 
required to be symmetric and non-negative, where a common choice is a radial basis function 
(RBF) kernel 


k{xi^ Xj) = exp { —, 


( 1 ) 


where a > 0 is a global scale parameter. A local scale can be set for each point as in [^. In 
practice, K can be computed using only the nearest neighbors of every point and K[i^j] is set 
to zero for Xj that are not among the nearest neighbors of Xi. 

We apply a normalization of the data to obtain an approximation of the Laplace-Beltrami 
operator on the data, so the embedding will not rely on the distribution of the points [^[To). 
The kernel is normalized by the degree of each point D[i, i] — J2jex by 


3 


A random walk is then created on the normalized data set by: 

P = D-^K, D{i,i) = J2K[i,j]. ( 2 ) 

3 


The row-stochastic matrix P satisfies P[i^j] > 0 and ~ ^ therefore can be 

viewed as the transition matrix of a Markov chain on the data set X. The eigen-decomposition 
of P yields a sequence of biorthogonal left and right eigenvectors, (pi and p3i respectively, and 
a sequence of positive eigenvalues: 1 = |Ao| > |Ai| > .... Then, t steps of the Markov chain can 
be calculated as 

Pt{xi,Xj) = X\^pi{xi)(l)i{xj). 
l>0 

A diffusion distance dj:,M{xi, Xj;t) between two points Xi,Xj € X is defined by 


dDMixi,Xj;t) = 


{pt{xi,xk) -pt{xj,xk)y 

(poi^k) 


i>i 


(3) 


where (po is the stationary probability distribution on the graph. This metric is robust to noise, 
since the distance between two points depends on all possible paths of length t between the 
points. Due to the spectrum decay of P, the diffusion distance can be approximated using 
only the first d eigenvectors. Equation (© implies that a mapping can be defined between the 
original space and the eigenvectors p3£. Retaining only the first d eigenvectors, the mapping 
embeds the data set X into the Euclidean space where the diffusion distance is equal to the 
Euclidean distance in this new embedding: 

{X\ipi{xi),Xlip2{xi),...,Xa^d{xip. (4) 


Note that pjQ is not used in the embedding because it is a constant vector. In this paper we set 
t = 1, but our approach can be used for estimating the embedding for general t. 
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2.2 Out-of-sample Function Extension 


Having calculated a diffusion map on the data X, various methods have been proposed for 
extending to new points. In simple examples, there are analytic mechanisms for creating har¬ 
monic extensions when the eigenfunctions of the Laplacian can be derived analytically. However, 
this is not applicable in the general case. The Nystrom extension method is a popular method 
for general OOSE. Given a new point x' G Af \ X, the eigenvector is extended to this point 
as: 

^ m 

= ^=^p{x',Xj)'ipi{xj), £ = 1, ■■■,(!. (5) 

,=i 


Geometric Harmonics is an OOSE method which improves upon the Nystrom extension 

method. It treats both the numerical instability due to A by extending only the eigenvectors with 
significant eigenvalues. In addition, it finds an appropriate scale for the kernel in the extension, 
dependent on the function that is being extended. Rabin and Coifman proposed a Laplacian 
pyramids-based OOSE method in [^. The eigenvectors are extended in an iterative multi-scale 
scheme, where the number of scales is adapted to the complexity of each eigenvector separately. 
This approach was recently extended in to implicitly incorporate cross validation in the 


training procedure and avoid over-fitting in the training. Aizenbud, Bermanis and Averbuch 14 


introduced an extension method based on a generalized least squares solution for each new test 
point within its local neighborhood in the training set. This solution is shown to minimize 
the Mahalanobis distance between the embedding of the training points and the estimated 
embedding for the new point, in respect to a covariance matrix that incorporates geometric 
properties of the data and embedding. 

The computational complexity of these methods typically depends on the number of training 
points, since these methods rely on the distances between a new test point and all training 
points, or its nearest neighbors in the training set (determined by a nearest neighbor search 
algorithm). Therefore, to perform OOSE, it is necessary to keep all the training points and 
their corresponding embeddings in memory. If the affinity matrix X in Q is based on non¬ 
it is not possible to use nearest neighbors search. 


Euclidean local metrics, such as 31,34-37 


since each training point is associated with its own local metric. This increases the complexity 
of the distance calculation and adds to the memory requirements of the OOSE. The method we 
propose has no such requirement. After training, our approach enables OOSE without retaining 
any of the training data or embeddings. 


2.3 Deep Neural Networks 


Artificial neural networks (ANNs) are networks composed of interconnected computational units 
termed neurons^ which are typically organized in layers. A deep neural network is composed of 
multiple hidden layers, and is typically designed as a feed-forward network, in which there are no 
cycles or loops. In our framework, we use multi-layer perceptrons (MLP), which are a popular 
and important class of neural nets, in which the layers are densely connected. The output 
of each layer is computed as an affine mapping of the previous layer followed by a non-linear 
function: 


where a is termed the activation, W^^^[i^j] denotes the weight associated with the connection 
between unit j in layer I and unit i in layer / +1, is a bias vector, (j(-) is a non-linear function 
applied element-wise, and = X. We denote the number of hidden units in layer I by si 
and the overall number of layers in a network, including the input and output, by L. In our 
experiments, we used a sigmoid non-linearity in the activation: a(z) = i^\-z ■ Other choices 
include a{z) = tanh( 2 ;) and rectified linear units: a{z) = ReLU( 2 :) = max{0,2;}. 
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Deep nets have been successfully applied to various tasks such as regression for learning 
a function over a given dataset, classification, feature learning, etc., achieving state-of-the- 
art results. The task the network performs is determined by the output layer and the cost 
function minimized over the network. In supervised learning, the goal is to predict a function 
or labels on the input data. The cost function of a network consists of a loss function, and a 
weight regularization term is usually added in order to prevent over-fitting. Given a multi-layer 
network, we denote the weights and biases of all the layers hj 9 — For regression 

of a multi-dimensional function, y G M^, as in our application, the squared error can be used 
for the loss term: 


^ m 

j{e) = ^Y 

2m ^ 

1=1 




L-1 


1 ^, 


( 6 ) 


1=1 


where o{xi) G is the output of the net for input || • |||. is the Frobenius norm, /i is a cost 
parameter. We use an I 2 penalty on the weights, which is a very popular choice for regulariza¬ 
tion. Neural nets are typically trained using variants of stochastic gradient descent (SGD) for 
calculating the weight and biases in the network that minimize the cost function. The gradient 
of the loss function relative to the weights is computed efficiently using backpropagation [38] , 
starting from the output layer backwards to the input. 


2.3.1 Autoencoders 

Deep learning can also be used in an unsupervised manner, such as in training autoencoders. 
An autoencoder is composed of an encoder function /(•) and a decoder function ^(•), where the 
dimension of / is typically smaller that the dimension of the input data. The reconstruction of an 
input Xi is given by stacking the decoder on the encoder: r{xi) — g{f{xi))^ and the autoencoder 
is trained to minimize a reconstruction error loss L(x, r{x)) over the training points, i.e. trying 
to approximate the identity function. This is performed by setting the number of output units 
of the decoder to be the dimension of the input data, and fine-tuning a loss function between 
the output and the input. The output of the encoder, i.e. the middle layer of the autoencoder, 
can be seen as a low-dimensional representation of the data. Autoencoders were popularized by 
Hinton 18 . Basic autoencoders consist of a single hidden layer (^,22 , but deep autoencoders 


have also been proposed for classification, denoising, image retrieval, speech enhancement and 
more 


18,21 


3 Diffusion Net 

3.1 Problem Setup 

We assume the data lies on a smooth, compact, d-dimensional Riemannian manifold A4 embed¬ 
ded in a high-dimensional space {d < n). Calculating the diffusion map : X ^ B 
for a given training set X — {xi}^^ maps the high-dimensional space to Euclidean space 
revealing the low-dimensional structure of the data. We address three related problems in this 
setting: (a) Out-of-sample extension, (b) pre-image solution and (c) outlier detection. 

Given new test points X' = {x'} ^ Ad \ X, the purpose of an out-of-sample extension 
method is to extend to the new points. For the given training points, the extension should be 
as close as possible to the true embedding: — T(x ^)||2 < e, for small e. In addition, the 

extension should preserve the properties of the original embedding, such as preserving 

local structures in the data. 

The second problem we address is calculating the pre-image [^[^. Given a point ip in 
the diffusion space, the pre-image of is a data-point x for which = ip. The pre-image 

has been shown to be closely related to the OOSE problem [^. Calculating the pre-image is 
essential to being able to pull back calculations made in the embedding space to the data space. 
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Finally, we aim to provide a new outlier detection measure, to detects outlier in newly arrived 
data. In OOSE algorithms, the ability to provide a good embedding for new points depends on 
how well they follow the model of the training data, i.e. their distance from the training data. 
For example in kernel-based extension methods such as Nystrom, the scale of the Gaussian 
kernels limits how close points need to be to the training data. Denote by x* = argmin||x' —x^|| 

Xi^X 

the nearest-neighbor of a test point x' within the training set X. If the distance between the 
two is very large, \\x' — x*|| a, then = ^^(x*), i.e. the OOSE reverts to 1-nearest 

neighbor prediction. Numerically, if \\x' — x*||/(j ^ 0, then the affinity to the training set 
evaluates to zero, and the OOSE is the origin. In related works (e.g. [8}|TT]), there is typically 
an implicit assumption that the test data follows the distribution of the training data. However, 
the embedding obtained for outliers is uninformative and can lead to mistakes in the task the 
embedding is used for, i.e. classification, signal processing, etc. Therefore, indicating if a point 
is an outlier within an OOSE framework is important to properly processing the data. 

3.2 Our Approach 

We propose a solution to these problems based on deep neural nets, training a neural network- 
based encoder and decoder, and combining them in a deep multi-layer autoencoder. Thus, 
instead of training a general autoencoder in which the middle layer can be seen as providing 
a low-dimensional representation of the data, we incorporate the given embedding on the data 
in the training procedure. The encoder and the decoder are both MLPs composed of L layers, 
whose output layer is a regression layer. In our approach, the encoder learns a mapping from 
the manifold to the diffusion space, by minimizing the squared loss between the output of 
the encoder and T(x). In addition, we impose a new constraint to preserve properties of the 
embedding. The decoder learns the inverse mapping from the low-dimensional diffusion space 
back to the high-dimensional space of the data, solving the pre-image problem. By stacking the 
encoder and decoder we obtain an autoencoder, whose inner-most layer computes the diffusion 
embedding of the data. The autoencoder can be used for both outlier detection and denoising. 
Our framework is presented in Algorithms and 

Algorithm 1 Out-of-sample Extension 

Encoder (Section [3.3[ ) 

Training phase: 

Input Training set X, diffusion embedding T, number of layers for encoder L. 

1: Initialize the weights 9^ = of the encoder and set = X, ^ = T. 

2: Optional: pre-train every hidden layer in every stack as an autoencoder. (Section |4.1[ ) 

3: Optimize the parameters of the network by fine-tuning cost Q with back-propagation. 
(Section |4.2| ) 

Test phase: 

Input test data X' 

4: Calculate the out-of-sample-extension to new points as the output of the encoder T(x') = 
o^(x'). 


3.3 Encoder 

The encoder learns the mapping between the high-dimensional data space and the embedding 
space. The encoder is designed as an MLP, minimizing the L 2 loss between the diffusion map T 
and the output of the net, i.e. o^(x^) = T(i) in Q, where denotes the output matrix of the 
encoder for all training points, G To facilitate the output of the encoder approximating 

the diffusion map, we add a new constraint to the training objective of the encoder. Since the 
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Algorithm 2 Pre-Image 

Decoder (Section [O] ) 

Training phase: 

Input Training set X, diffusion embedding T, number of layers for decoder L. 

1 : Initialize the weights 9^ = of the decoder and set = T, ^ = X. 

2 : Optional: pre-train every hidden layer in every stack as an autoencoder. 

3: Optimize the parameters of the network by fine-tuning cost with back-propagation. 

Test phase: 

Input test points in the embedding space {cp} 

4: Calculate the pre-image of the test points as the output of the decoder x = o^{ip). 


Algorithm 3 Autoencoder 

Autoencoder (Section [3~^ 

Training phase: 

1 : Stack decoder trained as in Alg. [^on top of encoder trained as in Alg. [^and obtain autoen¬ 
coder with 2 L — 1 layers. 

2 : Calculate average reconstruction error e. 

Test phase: 

Input test data X' 

3: Calculate reconstruction of the point as the output of the autoencoder x' — r{x'). 

4: Calculate an outlier detection score as threshold on the reconstruction cost \\r{x') —x'\\ > Ce. 


coordinates of the diffusion map are eigenvectors of the random walk matrix on the data,P 
in (§, we add a new cost term on the output of the encoder being an eigenvector of this matrix. 
The additional term, termed the eigenvector (EV) constraint, is 


jEV 


2m 


(7) 


where is the j-th row of the output matrix, Xj is the j-th eigenvalue of the affinity matrix, 
and T] is an optimization cost parameter. 

This new term imposes smooth locality on the training points, and maintains the local 
geometry of the points in the encoder function. In contrast to general constraints in deep 
learning, this constraint “mixes” output values of different data points together, via the random- 
walk matrix. The EV constraint enforces that an output value for a given data point can be 
reconstructed as a weighted average of the outputs of the local neighborhood of the data point. 
The locality of the neighborhood depends on the locality of the kernel used to define the affinity 
matrix 0- This constraint serves to minimize the loss error on the output, such that the 
output is not a general regression of the desired function, but an eigenvector of the matrix, 
thus maintaining the geometric properties of the embedding. Note that for a high value of 77 , 
this constraint forces the output to the origin, which is a trivial solution to minimizing this 
constraint. 

Although we use diffusion maps for manifold learning, other kernel methods may be used and 
imposed with the EV constraint. This only requires replacing the matrix P with the matrix that 
is decomposed in other approaches, for example the normalized affinity matrix 
as in spectral clustering or the unnormalized graph Laplacian matrix L = D — X as in 
Laplacian Eigenmaps [^. 

Adding the EV constraint to the cost function of the encoder, the total cost is: 


^ m 

r{e^) = — y 

2 m ^ 

i=l 




L-1 


+ 2 


1=1 




(P-AV)(o?) 


e^T||2 

P 


( 8 ) 
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where 9^ = denotes the set of the weights and biases of all the layers of the encoder. 


To incorporate this constraint in the gradient calculation in back-propagation, the gradient of 
this constraint in regards to the output layer is: 


d 


^ E - V) - V) ■ 


For new data points, the encoder performs out-of-sample extension, based on the mapping 
learned from the training points. This extension is computationally efficient as it relies on a 


few affine transforms and non-linear element-wise functions. In addition, once trained, it does 


not depend at all on the training data, making it efficient also in terms of memory. 

3.4 Decoder 

The decoder has the same architecture as the encoder, only in reverse. The input layer is 


a(^) = T, with size si = d, and y — X with size sl — n. There are no new constraints in the 
decoder cost, and it consists of a loss term and weight regularization term as in Q: 




m 



(9) 


where where 9^ — denotes the set of the weights and biases of all the layers of the 

decoder, and o^{i) G is the output for input Note that the decoder weights 

are not tied to those of the encoder therefore the number of units in the hidden layers 

can be different than for the encoder. Enforcing tied weights in the autoencoder will be explored 
in future work. 


The decoder solves the pre-image problem and has several interesting applications. As the 


decoder learns the non-linear inverse mapping between the data and the diffusion space, it 
enables to output points in the data space given new points in the diffusion space. This enables 


data visualization, performed by covering the diffusion space with new points, and inputting 
these points to the decoder. This also enables to perform data augmentation for increasing the 
training set in machine learning applications. Another benefit of the decoder is the ability to 
perform calculations in the embedding space and then pull back to the data-space, for example, 
calculating centroids in a clustering application, or interpolation of points in the embedding 
space. In our current formulation we do not impose constraints on the decoder, however, in 
future work we will explore including a PDE regularizer in the cost function, such as a harmonic 
constraint. This will enable recovering interesting surfaces, such as “minimal surfaces” from the 
embedding. 

3.5 Autoencoder 

Having trained the encoder and the decoder, the two networks are stacked together to obtain 
an autoencoder. This architecture is displayed in Figure The network is composed from two 
stacks, one for the encoder and one for the decoder. In most of our experiments each stack has 
2 hidden layers, and in some we use simpler stacks of a single hidden layer. 

One application of the autoencoder is to use the autoencoder reconstruction error to detect 
outliers in the data. Denoting the output of the autoencoder by r, the training data provides 
an average reconstruction error: 
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Figure 1: Autoencoder. Left: encoder with two hidden layers, Right: decoder with two hidden 
layers. The output of the encoder is a prediction of the diffusion map. The output of the 
decoder is a reconstruction of the data. 


This enables setting a threshold on new test points to determine whether they fit the model of 
the data learned by the autoencoder. If the reconstruction error for a new point x' is greater 
than this threshold, the point is considered an outlier and its OOSE can be disregarded. This 
condition is given by 

\\r{x) — x\\ > Ce, 


where (7 is a constant determined by the user and the training data. Performing outlier detection 
within the framework of OOSE was previously addressed in [^, where a Mahalanobis distance 
in the embedding space was used for anomaly detection. Our approach differs in that our 
measure depends on the reconstruction of the point in the data space and not in the embedding 
space. 

A second application of the autoencoder is denoising. When applied to noisy data, the 
diffusion map recovers a smooth manifold. Noise in the data, which relates to the relaxation time 
in the diffusion process, is typically manifested in eigenvectors relating to small eigenvalues [^ . 
These are are usually disregarded when choosing the number of dimensions to retain in the 
embedding. Thus, noisy points which relate to the same original clean point are embedded at 
identical coordinates, and this mapping is learned by the encoder. The decoder, which learns 
the inverse mapping from the embedding to the data, can only recover a smooth version of 
the data. Variations in the recovered data are such that they are along the low-dimensional 
manifold, whereas variations due to noise, that are not seen by the manifold, are suppressed. 
Thus, inputting new noisy points into the autoencoder, outputs a clean version of the points. 
Vincent et al. 21 provide an interpretation of their denoising autoencoder based on the manifold 
assumption, however they do not explicitly incorporate the manifold in their training procedure. 
In addition, they are intentionally corrupting the data by adding noise to the input and then 
training the network to recover the clean data. We do not assume the data is clean, but that 
the embedding provides a smooth representation of the data. 


4 Practical Considerations 

This section provides implementation details regarding the optimization of the diffusion net and 
the computational complexity of our approach. 
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4.1 Pre-training 


In general neural nets, a greedy layer-wise pre-training procedure was first developed by Hin¬ 
ton, Osindero and Teh for Restricted Boltzman Machines, and later extended to autoen¬ 
coders 19 . This procedure was shown to improve the optimization of the neural network. In this 
approach, each layer is first trained in an unsupervised manner as an autoencoder whose input is 
the output of the previous layer. This enables to “initialize” the parameters 9 = of 

the network before performing backpropagation, termed fine-tuning, over the complete network 
for the supervised learning task. It was shown empirically that this procedure yields improved 
results compared to initializing with random weights which can get stuck in poor local minimum 
solutions. 

Recently, deep networks trained using very large quantities of labeled data have achieved 
successful results, with pre-training having very little impact, if at all. It appears that employing 
large amounts of data in the training cancels the advantage gained by pre-training [^. In our 
case, the amount of the data used to train the network is not large, therefore pre-training is 
beneficial. 

We pre-train every hidden layer as a denoising autoencoder with a sparsity term. This 
term encourages the average activation of every hidden unit to be small so that the hidden 
representation of the data is sparse [40|. The loss function is given by 


^ m 2 Si 

^ E11^* - ^*11' + f E +p^mp\\pj), (11) 

i=i ^ 1=1 j=i 

where 9 — The input point is x is x corrupted by setting a random 

subset of entries to zero and Xi is the reconstruction of Xi by the autoencoder. The third term 
is the sparsity loss where where si is the number of hidden units, p is a sparsity parameter, 
typically a small value close to zero, pj is the average activation of hidden unit j and KL(p||pj) 
is the Kullback-Leibler divergence between the probability mass functions of Bernoulli random 
variables with parameters p and pj. This constraint imposes that each hidden units “responds” 
only to specific input patterns in the data. 


4.2 Optimiztion 

The cost function in neural networks is highly non-convex. However, convex optimization 
methods are used, as typically a local minimum yields good results. We perform training 
with Limited memory BFGS (L-BFGS) with line search procedure, a quasi-Newton 

method implemented in Matlab’s minFunc package, and the gradients are computed using 
standard back-propagation [^. L-BFGS is a suitable approach when the dataset is not too 
large, and even out-performs SGD, which is a more popular approach in deep learning, in certain 
settings [io] . 

The weights and biases in all layers are initialized with random values from a normal distri¬ 
bution with zero mean and small variance. The weight regularization parameter p was set to 
be small, i.e. order of 10“^ or 10“^^. In the experimental results in Sectionwe examine the 
performance of the new constraint 0 for different values of 77 . 

4.3 Mini-batch training 

Mini-batch training is typically used with large datasets in deep learning and has been shown to 
improve performance. In this method, a random subset of samples is chosen for each iteration 
(or number of iterations) and the gradients are computed and averaged only for this small 
subset. Since the constraint we added 0 entails multiplying the output of the encoder by 
the normalized random walk matrix, it is essentially “mixing” between the outputs of different 
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training points. This is problematic in mini-batch training, since a point’s local neighborhood 
which most influences the eigenvector constraint will not necessarily be included in the mini¬ 
batch. If extracting only a subset of points, a new matrix needs to be calculated for the mini¬ 
batch by extracting the relevant rows and columns, and renormalizing the new matrix. In this 
case, this constraint is reduced to the output being only an approximation of the eigenvector of 
the full random walk matrix. However, since the size of training data used in manifold learning 
is usually not large due to computational complexity constraints, it is unnecessary to train using 
mini-batches. In this our setting differs from typical large-scale applications in deep learning 
such as Imagenet, CIFAR-100, etc. 

4.4 Computational complexity 

Training is an iterative sequential process, performed offline. Once the encoder has been trained, 
out-of-sample extension on new data is calculated with complexity where si — n 

and sl = d. In our experiments, we used an encoder with two hidden layers so the complexity 
was of order 0{ns2 + ^ 2^3 -h s^d). This is opposed to other OOSE methods in which the 
complexity depends on m, the number of training points, where typically, m ^ n > d. For 
example, the complexity of OOSE using Nystrom is (9((d-hn)m). 

An advantage of our method is that there is no need to retain the training points and 
embedding, once the network has been trained. Only the weight matrices and bias vectors of 
all the layers of the network are necessary. Thus, our approach requires memory on the order 
of 0{J2i=i Other methods, on the other hand, require retaining all training points 

and embeddings. For Nystrom and Geometric Harmonics this results in memory on order 
of 0{{d + n)m). The memory cost of the PCA-based approach in is higher, requiring an 
additional 0{md^) to save covariance matrices for the embeddings of the training points. If 
using a non-Euclidean metric as in [^[^, retaining the covariance matrices of the training 
points leads to an additional memory cost on the order of 0{min?). Thus, our method has a 
large advantage in applications and systems in which the memory and run-time are limited. 


5 Bounding the Out-of-sample-extension Error 

In this section, we provide a theoretical bound on the error rate for approximating eigenfunctions 
of the Laplacian using a single layer network of sigmoid hidden units. The full derivation is 
given in 1^ Suppose Ad is a smooth compact d-dimensional submanifold of Assume Ad 
is equipped with a metric p : Ad x Ad ^ which is locally bi-Lipschitz with respect to the 
Euclidean metric, meaning 35, e such that 

(1 - e)\\x - y \\2 < p{x,y) < (1 + e)\\x - y\\ 2 , for ||x - y\\ < 6. 


Theorem 5.1. Let M he a smooth eompaet submanifold of ML equipped with a metrie whieh is 
loeally bi-Lipsehitz with respeet to the Euelidean metrie. Let he a Euelidean hall of radius r 
sueh that Ad C Let be an eigenfunetion of the Laplaeian of M with eigenvalue \, and let 
f :M^ ^M be an extension of the eigenfunetion to the ambient dimension via 

f{x) = exp{—X\\x — PMx\\^)f^{PMx)-> where Pmx = argmin^^;\a \\x — y\\ 2 . (12) 

Then there is a linear eombination fx of K sigmoidal units sueh that 


/ 


{fix) - fK{x)fdx ) < -E. 


(13) 
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Corollary 5.2. Under the same eonditions as Theorem \5.1\ let be eigenfunetions of 

the Laplaeian and fi be the extension of ipi. Let f{x) = (/i(x),/^(x)). Then there exists a 
single hidden layer network with Kd sigmoidal units and output o(x) G sueh that 





(14) 


Note that Theorem |5 .1 1 guarantees existence of a solution, yet since the optimization problem 
is non convex, it does not guarantee convergence. 


6 Experimental Results 

In this section, we present experimental results for several toy problems and real image data. 
We demonstrate the performance of the encoder and decoder separately and then joining them 
in an autoencoder. We evaluate the effect of the eigenvector constraint and demonstrate how it 
improves the performance of our system, especially in noisy scenarios. Finally, the autoencoder 
is successfully used in outlier detection in images. This demonstrates that our solution not 
only performs OOSE, but also means by which to verify that new data agrees with the model 
inferred from the training data. 

6.1 Encoder 

Our first example is a closed 3-D curve parametrized by: 

a;* = (cos(0j),sin(26'j),sin(30j))^ , 0je(O,27r). (15) 

We examine the effect of adding the eigenvector constraint 0 to the encoder cost function 0 
for this toy problem. We address both the effect of the architecture of the network, i.e. number 
of layers and units, and the effect of noise in the data. We add white Gaussian noise Vi with 
standard deviation (std.) — 0.05 to the data: 

Xi = Xi + Vi, V 

and sample 2000 points from this noisy curve. The 2D diffusion map of these points is a smooth 
circle (Fig. [^. 


0.5 

0 

- 0.5 


-1 - 0.5 0 0.5 

Figure 2: (a) 3D closed curve, (b) First two coordinates of the diffusion map. Points colored 
by Oi. 

Since there is a great deal of randomness in the optimization, we trained encoders for 10 
realizations of the data, and averaged the MSE over the realizations. The encoder MSE for a 
single realization of the data is calculated as 

^ m 

i=l 
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Figure 3: (a) MSE of the encoder for the training data, comparing 1 hidden layer (red ’x’) vs. 
2 hidden layers (blue circle) for varying rj values, (b) MSE of the encoder for the training data, 
comparing various r] values. 


Figure l^a) plots the MSE for various values of 77 for an encoder with 1 hidden layer and with 2 
hidden layers. In Fig. we plot several results from this experiment. The top row displays the 
output of the encoder with one hidden layer and the bottom row with two hidden layers. Column 
(a) is the result without the added constraint. Columns (b)-(d) are for 77 = 10, 10^, 10^ 

respectively. 

A first conclusion from these simulations is that adding depth to the network improves the 
results. This follows known observations in deep learning, that functions that can be compactly 
represented by networks with L layers, can require an exponential number of units in a network 
with L — 1 layers [^. Using one hidden layer, the reconstruction of the embedding is noisy 
and the added constraint is not enough. Using two hidden layers, the results are significantly 
improved and the eigenvector constraint serves to smooth out the noise in the reconstruction. 
In terms of the MSE, adding a layer decreases the error by a factor of 10. When the cost 
parameter is too large as in Fig. Qd), this constraint collapses the output of the encoder to 
the origin, which is the trivial solution to minimizing this constraint. The optimal results were 
achieved for 77 = 10 — 100. Not imposing the EV constraint (77 = 0) yielded an error that was 
twice as high and results in a noisy approximation. 

To examine the effect of noise on the approximation, we ran simulations with varying values 
of ajy. The encoder was trained with 2 hidden layers, with 20 hidden units in each layer. We 
averaged the training MSE over 20 realizations of the data, where for each realization several 
encoders were trained, one for each value of 77 in the EV constraint, displayed in Fig. [^b). The 
best results are consistently obtained by 77 = 100, regardless of the level of noise. Even for 
no noise (cr^y = 0), including the EV constraint with proper 77 is meaningful for estimating the 
diffusion embedding (77 = 0 has second or third highest error for all values). If is 77 is too 
high, the solution collapses to zero, resulting in a high error. For low cr^y, 77 = 1000 performs 
well so there is some trade-off between the SNR and how high 77 can be. 

To examine the dependence of the mean error on the number of units in the single layer 
encoder, we trained the network with a varying number of hidden units si G {5,..., 60}, for 
77 = 100, given clean data, i.e. cr^y = 0. Figure shows that the empirical mean error e is 
bounded by so that empirically we are achieving a tighter bound than the bound guaranteed 
in Corollary |5.2[ 

Regarding the parameters used in the simulations, the number of units in the hidden layers 
in the simulations was si = 20, unless stated otherwise. The cost parameter of the weight 
regularization term was set as /i = 10“^^. For the sparsity constraint in pre-training the auto¬ 
encoder 0 , we tried various values of /3 for p = 0.1 and they do not affect the errors in any 
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Figure 4: Top row: Encoder with 1 hidden layer, Bottom row: Encoder with 2 hidden layers. 
Blue plot is original diffusion map, red plot is output of the encoder. The columns examine 
the effect of the eigenvector constraint for increasing values of r]. (a) 77 = 0 , (b) 77 = 10 ,(c) 
T] = 10^, (d) r] = 10^. If T] is too large, the eigenvector constraint dominates the cost function 
and the output collapses to the trivial solution that all points equal zero. For a one hidden 
layer encoder, the eigenvector constraint is not enough to smooth out the noise and obtain a 
good embedding. For a two layer net, increasing 77 within a reasonable range smooths the noise 
and yields an improved output compared to not including this constraint. 


significant manner. Therefore, we set /3 = 1. 

6.2 Decoder 

The decoder solves the pre-image problem. It can be used to extend data from the diffusion 
embedding space to the data space, thus creating new points that lie on the manifold. This 
enables a better visualization of the data belonging to the manifold, in the data space. 

Our first example is based on the clean version of the 3D curve, given in ( [T^ . We draw 2000 
random samples from this curve. The diffusion map is calculated with a = O.I for the scale 
in the kernel 0 - The first two diffusion coordinates are a circle, as in the previous example. 
We train a decoder using these 2000 samples and their embedding. We then cover the diffusion 
space with points enclosed within a circle extended beyond the radius of the embedding and use 
the decoder to predict the data in the 3D data space. Figure [^top-left and bottom-left display 
the points in the embedding space colored by angle and radius, respectively. The remaining 
three columns display different views of the predicted points in the 3D data space, where the 
colors correspond to the color of the points in the diffusion space. We can see the points are 
restricted to a 2D surface enclosed within the 3D curve. Note that the embedding of the origin is 
handled smoothly, with no singularities in the data space. Also note that points in the diffusion 
space which are located beyond the original circle, are decoded in the data space along the 
boundary of the surface. Thus, the range of extension is limited and follows the geometry of 
the original data. 

The next example is of more complex data, whose embedding is also a 2D circle. Given an 
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Figure 5: Embedding estimation error vs. number of hidden units si in 1-layer encoder (solid 
line). The empirical error is bounded by ^ (dashed line), which is a tighter bound than the 
theoretical bound guaranteed in Corollary 


5.2 


image of a noisy periodic function (std. of the noise is 25), shown in Fig. [^a), we extract all 
overlapping patches sized 5x5 pixels, and construct a random walk on the patches, to obtain 
the diffusion map Figure [^b) displays examples of 49 patches extracted from the image. 
The first two coordinates of the diffusion embedding are a circle shown in Fig.j^c). Training the 
a decoder with two hidden layers for the diffusion map and the patches, yields a function that 
“produces” image patches. The input dimension is si = 2 and the output dimension is 5^ = 25. 
The decoder enables visualization of the data. Figure [^a) displays the patches obtained by 
inserting several test points from the diffusion space into the decoder, where the position of 
the patches in the circle corresponds to the location of the points entered into the decoder. 
This example shows that the radius in the diffusion embedding represents the amplitude of the 
periodic function in the image patches. At the origin of the diffusion space, you get a smooth 
patch. Note the patches are clean as opposed to the training patches. 

Figure l^b) shows how this can be used for image manipulation. A rotation of tt and scale 
by 0.45 is applied to the diffusion map as 

-i)*- 

The values of 4^ are inputted into the decoder. Then the output values are used to reconstruct 
the image where each value is assigned to its original pixel location. The resulting image is 
indeed a shift of the original image in Figure and the amplitude has been decreased. 


6.3 Autoencoder 


We train an autoencoder for the noisy 3D data described in Sec. |6.1[ First, we train a decoder 
from the embedding to the noisy data, using either one or two hidden layers, corresponding to 
what was used in the encoder. We then stack the decoder on top of the trained encoders, and 
obtain an autoencoder. We sample 1000 new test points from the noisy curve, and calculate 
their reconstruction via the autoencoders. These training and test phases were repeated for 10 
realizations of the data. For each realization of the data, we train several encoders for different 
values of r] and one decoder. Thus, the MSE of the autoencoder depends on r] and not on the 
optimization of the decoder, i.e. a lower cost is due to the encoder since all autoencoders share 
the same decoder. We then average the reconstruction MSE over all realizations, for each r] 
separately. The reconstruction error is given by 


^ m 

' = -Ei 

m 


- r{xi 


1=1 
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Figure 6 : Left column: the diffusion space is covered in points within a circle. The black points 
are the diffusion coordinates for the training data and are the input points to the decoder. The 
remaining three columns are different views of the output of the decoder, where the black points 
are the original training datapoints. Top row: points colored by angle in the diffusion space. 
Bottom row: points colored by radius. 


Note that we are comparing the output of the autoencoder to the clean data although the 

autoencoder is trained using the noisy data {xi)i. This demonstrates the denoising capabilities 
of the network. Figure]^ a) summarizes the results for various values of 77 for autoencoder with 
1 hidden layer, and with 2 hidden layers. Adding a layer decreases the reconstruction MSE by a 
factor close to 10. We can see that in all simulations, the training and test reconstruction errors 
in the data space are of the same order, so that the algorithm does not over-fit the training 
data. In addition, using 2 layers in the network improves both encoder and autoencoder errors. 
In addition, as previously shown, if 77 is too large, for example 77 = 10^, this corrupts both the 
embedding and the data. The best overall results were achieved for 77 = 100. 

To examine the effect of noise on the diffusion net, we trained the autoencoder for varying 
values of The network was trained with 2 hidden layers in the encoder, and 2 hidden layers 
in the decoder, with 20 hidden units in each layer. Following the experimental results of the 
encoder, we set 77 = 100 in the EV constraint. We average the training and test reconstruction 
error over 20 realizations of the data, displayed in Fig.j^b). The reconstruction error is between 
the reconstructed output and the original clean data, demonstrating the denoising capabilities 
of the diffusion net for increasing noise std. As in the previous example, the training and test 
errors are very similar, implying that the training does not over-fit the data, both in the encoder 
and in the decoder, even in increasingly noisy scenarios. 

6.4 Outlier Detection 

We now apply the autoencoder to real image data and demonstrate that the autoencoder per¬ 
forms outlier detection. As stated in Sec. test data will not necessarily follow the model of 
the data used to calculate the embedding. OOSE applied to new data that significantly differs 
from the training data will assign embedding values that do not distinguish it from the data. 
Therefore, it is important to be able to determine when the embedding is unreliable. 
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Figure 7: (a) Image of noisy periodic function, (b) Example of 49 training patches extracted 
from the image, (c) Diffusion map of training image. 



Figure 8: (a) Decoding image patches. Various points along and within the circle are inputted 
to the decoder. We display the obtained patches at the locations in the diffusion space that were 
used an input to the decoder. This demonstrates that the radius of the embedding corresponds 
to the amplitude of the periodic function in the image space, (b) Image manipulation: the 
training diffusion map is rotated by 180 degrees and the values multiplied by 0.45. Inputting 
these points into the decoder and reconstructing the image from output, shows the period of 
the image has indeed shifted and the amplitude decreased. 


Figures [l()[ a) and [Tl[a) display two images of patterned semiconductor wafers acquired by 
a scanning electron microscope, sized 200 x 200 pixels. Both wafer images have a defect near 
the center of the image. For each image separately, we randomly extract 2500 patches from 
the image, sized 8x8 pixels. This training set is used to calculate a diffusion map, reducing 
the data from dimensionality n = 64 to d = 2 dimensions. The training set is used to train an 
autoencoder, as in Algorithm]^ and the average training reconstruction error e (10) is calculated. 
We then input all overlapping image patches from the image into the autoencoder and calculate 
the reconstruction error of each patch. Figures [To|(b) andpT|(b) display \\r{x')—x'\\/e for all pixels 
in the image, revealing that this approach easily separates the defects from the background. 

This is a result of the diffusion map capturing the main structure of the data, i.e. the pattern 
of the wafer, as represented in the training data. Patches which differ from the training set, as 
in the case of the defects, are not represented in the diffusion map. Thus, when applying the 
autoencoder, these patches are not properly reconstructed by the mappings learned from the 
data space to the diffusion space and back. This result obtained by the autoencoder indicates 
that, for these patches, the embedding provided by the encoder does not properly represent 
them. 
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Figure 9: (a) MSE of data reconstruction via the autoencoder, comparing 1 hidden layer (red 
’x’) vs. 2 hidden layers (blue circle) for varying r] values. Dashed line is training data, solid 
line is test data, (b) MSE of data reconstruction vs the std. of the noise. Reconstruction is 
performed via the autoencoder for r] — 100 in the EV constraint. Solid line is training error, 
dashed line is test error. 
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(a) 

Figure 10: (a) SEM image of a semiconductor wafer with defect near the middle of the image, 
(b) Reconstruction error of the image relative to the average training error: \\r{x') — x'\\/e. This 
reveals the wafer defect. 

6.5 OOSE Error vs. Variation in In-sample Embedding 

The purpose of an out-of-sample extension algorithm in the manifold learning setting is to 
provide an extension of the embedding to new points, such that this extension on the new 
points is close to the embedding of these points, if the embedding was calculated over all the 
points. Due to the discrete nature of the data, there is no “true” embedding; rather the value 
of the eigenvectors for a given training point depends on the other points in the set and the 
scale used in the affinity matrix. If the eigenvalues of the decomposed matrix have a geometric 
multiplicity, than the eigenvectors belonging to the same eigenvalue span a subspace, and for 
the same data they are similar up to a rotation. In our experiments in Sec. |6.1[ the data is closed 
curve, so that its continuous equivalent is the heat equation with Neumann boundary condition. 
Thus continuous solutions to this PDE are the trigonometric functions sin(-),cos(-). For our 
discrete 3D data, the eigenvectors approximate these continuous periodic eigenfunctions. The 
first two eigenvectors belong to the same eigenvalue and form a 2D circle in the embedding 
space. 

To demonstrate the performance of our OOSE approach, we conducted an experiment similar 
to that proposed in [^. We took m — 2000 training points along the curve and calculated the 
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Figure 11: (a) SEM image of a semiconductor wafer with defect near the middle of the image, 
(b) Reconstruction error of the image relative to the average training error: ||r(x') — x'\\/e. This 
reveals the wafer defect. 


diffusion embedding T for these points. We then added ntest points to the training points, and 
calculated am embedding T for all m + ntest points. Since the embedding in both cases is a 
circle, we calculated the rotation between both embeddings using the shared points, i.e. the 
training points. This was calculated by 

S[i,j] = k e {1,...,2000} i,j = 1,2. 

k 

The SVD of S is C/AE^. Then the rotation from T to T is given by 

R = VU^. 


We then calculated the error 

= II’ ^ ^ •••’ 2000}- 

We calculated this error over 10 realization of the data and for 4 values. This error is 
the variation in the in-sample embedding due to increasing the number of points for which 
the embedding is calculated. Figure [T^ compares this error to the error achieved by our best 
encoder, trained using m = 2000 points and r] = 100. For ntest > 100, the out-of-sample 
extension error of our encoder is lower than the variation in the in-sample embeddings. Since, 
OOSE will typically be performed for ntest much higher than 100, this demonstrates that our 
method provides a good out-of-sample extension. 

7 Conclusions and Future Work 

We have presented a new framework employing deep learning for manifold learning. We 
proposed designing an encoder and decoder that learn the mapping between a given high¬ 
dimensional dataset and low-dimensional embedding, and vice-versa. To this end, we proposed 
a new constraint in training the encoder, which preserves the locality of the points in the 
embedding. We demonstrated empirically that this constraint improves the approximation of 
the embedding. Our encoder enables very efficient out-of-sample extension of the non-linear 
embeddings to new points, with low memory costs. The decoder provides a solution to the 
pre-image problem, enabling data visualization and augmentation. Finally, stacking the two 
networks together as a deep autoencoder enables both denoising and outlier detection of the 
data, as seen via the embedding. Calculating the reconstruction error of the autoencoder for 
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Figure 12: Dashed plot is solid line is DN encoder training error. 


new points allows to evaluate whether the OOSE provided by the encoder properly represents 
these new points. We presented experimental results in noisy scenarios for simulated and real 
data, demonstrating the properties of the proposed architecture. 

Our main focus has been on the encoder for performing out-of-sample extension for data 
whose distribution follows the distribution of the training data. However, in different appli¬ 
cations, such as sequential signal processing, the nature of the data can change over time. In 
manifold learning, the embedding is usually calculated once for training points, and does not 
adapt over time for new points, as opposed to online dictionary learning in sparse representa¬ 
tions, for example. This could lead to the embedding not providing a good representation of 
the data as it evolves, and requires re-calculating the embedding again and again. In future 
work, we propose to develop a method based on online fine-tuning of the autoencoder that will 
adapt the embedding to new points which do not fit the model of the training data. Instead of 
performing “regular” fine-tuning of the autoencoder, constraints can be added that will main¬ 
tain the middle layer as an approximation of the embedding, as we proposed with the encoder 
in this work. In this case, we will fine-tune with both the test and training data, where the 
training data regularizes the autoencoder so that its middle layer remains an approximation 
of the embedding for the training points. By fine-tuning the network so that it reconstructs 
the new test data, the middle layer should recover a new embedding for the test data. This 
adaptive approach will be explored in future work. 

A second direction is to further explore the decoder and how including different regulariza¬ 
tions affects the solution of the pre-image problem. Including a harmonic constraint for example 
should enable recovering a minimal surface as the example shown in Sec. 6.2 The error rate 


we provided on the encoder does not apply to the decoder as it requires the function that is 
being approximated to be band-limited, which does not hold for the decoder in a general case. 
In future work on the decoder, we intend to provide a theoretical analysis of the decoder, and 
to expand our theoretical results to multi-layer nets. Computing the pre-image is important 
in different applications in which interpolating the data by averaging in the high-dimensional 
data space is meaningless, such as the possibility of performing image texture synthesis. We 
will analyze datasets in which the high-dimensional data is more complicated and examine how 
this affects the required complexity of the decoder architecture. 

A third research direction is to examine improving deep learning applications. Our network 
enables to determine the number of nodes needed to learn the geometry of the data and can be 
used to infer the maximal number of nodes needed to model the complexity of the system for an 
unconstrained neural net. In addition, we intend to explore whether incorporating our encoder 
into a deep network will improve deep neural networks. This is motivated by previous works 
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that have shown that implicitly incorporating the manifold assumption in the construction of 
deep networks improves classification results. Therefore, we expect that explicitly including the 
embedding in the network via the encoder should be beneficial. 
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A Proof of Theorem 5.1 


Proof. The proof of the theorem requires two key results in the literature, relying on theorems 
by Barron and Coifman and Lafon [^. 

Theorem A.l. (Theorem 1 from J^ ) Let f : ^ M a funetion with bounded first moment 

of its Fourier transform 


j 


\f{uj)\\uj\dw < Cf. 


Let Br C be a Euelidean ball around zero with radius r, whieh we assume eontains our data. 
Then for every n G N there exists a linear eombination fx of K sigmoidal units sueh that 


I 


(fix) - fK{x)fdx ) < 


(16) 


where Ci — 27vCf ja{Br), and fi is the Lebesuge measure on . 


Now it suffices to show that the extension function f of from (12) has a bounded first 
moment of its Fourier transform. To show this, we rely on a second result. 

Theorem A.2. (Proposition 11 from Let M. be a submanifold in MP and :T be an 
eigenfunetion of its Laplaeian with eigenvalue A. Let 5 > f) be an approximation level. Let f 
be an extension funetion as in (12). Then there exists a band limited funetion b : ^ M with 

band sueh that 


-h{x)fdx 


< 5. 


lRn{f{x)rdx 

We need several more intermediate claims before addressing the result. 
Claim A.3. The funetion f from (12) is in L‘^{]SP). 

Proof of Claim A. 4 Since p is locally bi-Lipschitz, 

[ f{xfdx = f 


< 


= a,- 


< 


[ ( [ fp{x'fdx' 

JM \JR^-d ) 

f fj{x')‘^dx' 

JM 

j 'ip{z)‘^dp{z) 

^ ^ JM 


Cn-^ 

1 - 
Cn—d,X 
1 — e 
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Claim A. 4. Let b be as in Theorem A.2, and b be its Fourier Transform. Then b G 
and 


\b{(jj)\\(jj\d(jj < oo. 


Proof of Claim \A.Ji\ Clearly 6 G L 




Since 


Because b is band limited with band Cs^'ipX^ meaning supp(6) is contained inside a ball of 
radius Cs^'ipX^ then 


f \b{(jj)\\(jj\d(jj = f \b{Ljj)\\(jj\d(jj 


< fj {b{uj))‘^d(jj^ (J \ujfd(jj 


< 


{b{x))‘^dxj [ f \(jj\‘^d(jj 

i>+^ii/iilO 1 / 

V-'BOw» 






(l + \I 'C+ 2 


lnil{Bl) n+2 


where /i(Si) is the volume of a ball of radius 1 in 


Now we prove Theorem 5.1, By setting 5 = i in Theorem 


results to show 


A.2 


□ 


we combine the above 


jif{x) - fKix))‘^dx^ < (fix) - b{x))'^dx^ + {b{x) - fK{x)fdx 


< 


C 


where 


C = 


2 +Cl 
2 + ‘2'K 


< yi + ‘ 

< I 1 + : 



/nuiBi) 

n+2 \ 

/ n + 2 
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